-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Fix test_train to not rely on 'Sanity Checking' stdout in multi-GPU runs
#10478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
for more information, see https://pre-commit.ci
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10478 +/- ##
==========================================
- Coverage 86.11% 85.09% -1.02%
==========================================
Files 496 510 +14
Lines 33655 35962 +2307
==========================================
+ Hits 28981 30601 +1620
- Misses 4674 5361 +687 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
for more information, see https://pre-commit.ci
…ometric into sanity_check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @akihironitta to merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, cc @akihironitta
Invalid email address
… runs (pyg-team#10478) The test `test_train` previously asserted on the presence of the "Sanity Checking" message in stdout. This was brittle because in multi-GPU/DistributedDataParallel runs, **only rank 0 prints this message**, so tests running on other ranks failed. This PR updates the test to: - Remove the fragile stdout assertion. - Assert trainer state (`!trainer.sanity_checking`, `current_epoch >= 0`). - Use LoggerCallback to verify that both training and validation ran. This makes the test deterministic and robust across single-GPU, multi-GPU, and CI environments. [PassingLog.TXT](https://github.com/user-attachments/files/22671419/PassingLog.TXT) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <[email protected]> Co-authored-by: Rishi Puri <[email protected]>
The test
test_trainpreviously asserted on the presence of the "Sanity Checking" message in stdout. This was brittle because in multi-GPU/DistributedDataParallel runs, only rank 0 prints this message, so tests running on other ranks failed.This PR updates the test to:
!trainer.sanity_checking,current_epoch >= 0).This makes the test deterministic and robust across single-GPU, multi-GPU, and CI environments.
PassingLog.TXT